Marketing Analytics Process

Inferential Modeling Workflow

Preprocess Data

So far we have been trying to slowly build intuition about inferential models. Now we want to talk about the data itself. In particular, we often need to preprocess the data so it is appropriate for the model to use. This may involve creating new variables as well. This is also known as feature engineering. Again, think of preprocessing as data wrangling for models.

We’ve technically already done some preprocessing: data splitting.

# Load packages.
library(tidyverse)
library(tidymodels)

# Import soup data.
soup_data <- read_csv(here::here("Data", "soup_data.csv"))

# Split data.
soup_split <- initial_time_split(soup_data, prop = 0.90)

Preprocessing data is made easier using the recipes package where common preprocessing tasks have been broken down into steps, beginning with a recipe() where we identify the model outcome and explanatory variables (a.k.a., predictors) using formula notation.

soup_recipe <- training(soup_split) |> 
  recipe(Sales ~ Any_Disp_Spend + Any_Price_Decr_Spend + Sub_Category)

soup_recipe
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:   1
## predictor: 3

Right-Skewed Continuous Data

A common problem with data is having right-skewed continuous data, something to identify when we’re doing exploratory data analysis.

Why is right-skewed continuous data common?

# Visualize Sales.
soup_data |> 
  ggplot(aes(x = Sales)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It’s not just Sales.

# Visualize all continuous variables.
soup_data |> 
  pivot_longer(
    c(Sales, Any_Disp_Spend, Any_Price_Decr_Spend),
    names_to = "var_names",
    values_to = "cont_values"
  ) |> 
  ggplot(aes(x = cont_values)) +
  geom_histogram() + 
  facet_wrap(~ var_names)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 704 rows containing non-finite outside the scale range
## (`stat_bin()`).

Pause! Why did the plot report that values were removed?

Let’s Check for Missing Values

soup_data |> 
  filter(is.na(Sales) | is.na(Any_Disp_Spend) | is.na(Any_Price_Decr_Spend)) |> 
  select(Sales, Any_Disp_Spend, Any_Price_Decr_Spend)
## # A tibble: 352 × 3
##    Sales Any_Disp_Spend Any_Price_Decr_Spend
##    <dbl>          <dbl>                <dbl>
##  1    33             NA                   NA
##  2  6413             NA                   NA
##  3    30             NA                   NA
##  4  1182             NA                   NA
##  5   415             NA                   NA
##  6   107             NA                   NA
##  7   163             NA                   NA
##  8   669             NA                   NA
##  9   132             NA                   NA
## 10    51             NA                   NA
## # ℹ 342 more rows

Plot without Missing Values

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The linear model assumes normally distributed errors. This means our Y variable should be approximately normal (or at least as close as we can get it)! We can preprocess right-skewed continuous data by using a log() (i.e., logarithmic) transform.

  • Why a log() transform? What is a logarithm?
  • What is the inverse of a logarithm?
# Use the log() transform.
soup_data |> 
  pivot_longer(
    c(Sales, Any_Disp_Spend, Any_Price_Decr_Spend),
    names_to = "var_names",
    values_to = "cont_values"
  ) |> 
  mutate(log_cont_values = log(cont_values + 1)) |> 
  ggplot(aes(x = log_cont_values)) +
  geom_histogram() + 
  facet_wrap(~ var_names, scales = "free")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 704 rows containing non-finite outside the scale range
## (`stat_bin()`).

We add this step to our recipe using step_log(). We can reference all continuous variables in the model with all_numeric(). The argument offset = 1 says to add 1 before using log() so we can avoid log(0).

soup_recipe <- training(soup_split) |> 
  recipe(Sales ~ Any_Disp_Spend + Any_Price_Decr_Spend + Sub_Category) |> 
  step_log(all_numeric(), offset = 1)

soup_recipe
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:   1
## predictor: 3
## 
## ── Operations
## • Log transformation on: all_numeric()

Discrete Data

How many binary variables will we have when we dummy code this discrete variable?

soup_data |> 
  count(Sub_Category)
## # A tibble: 5 × 2
##   Sub_Category       n
##   <chr>          <int>
## 1 CONDENSED SOUP   410
## 2 DRY SOUP        1194
## 3 RAMEN            217
## 4 RTS SOUP        1310
## 5 SOUP IN OUT       61

We add this step to our recipe using step_dummy(). We could reference all discrete variables in the model with all_nominal().

soup_recipe <- training(soup_split) |> 
  recipe(Sales ~ Any_Disp_Spend + Any_Price_Decr_Spend + Sub_Category) |> 
  step_log(all_numeric(), offset = 1) |> 
  step_dummy(Sub_Category)

soup_recipe
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:   1
## predictor: 3
## 
## ── Operations
## • Log transformation on: all_numeric()
## • Dummy variables from: Sub_Category

Prepare a Recipe

We can build the recipe all at once. Once we’re ready, we can prep() the recipe using the training data. The resulting object is a complete recipe that we can apply to whatever data we’d like.

soup_recipe <- training(soup_split) |> 
  recipe(Sales ~ Any_Disp_Spend + Any_Price_Decr_Spend + Sub_Category) |>
  step_log(all_numeric(), offset = 1) |> 
  step_dummy(Sub_Category) |> 
  prep()

Apply a Recipe

Following the recipe analogy, we apply a prepared recipe to data using bake().

# Apply the recipe to the training data.
baked_soup_training <- soup_recipe |>
  bake(training(soup_split))

# Apply the recipe to the testing data.
baked_soup_testing <- soup_recipe |>
  bake(testing(soup_split))

baked_soup_training
## # A tibble: 2,872 × 7
##    Any_Disp_Spend Any_Price_Decr_Spend Sales Sub_Category_DRY.SOUP
##             <dbl>                <dbl> <dbl>                 <dbl>
##  1           0                    0     6.73                     0
##  2           9.71                10.8  13.1                      0
##  3           0                    0     4.52                     0
##  4           0                    7.58 11.8                      0
##  5           0                    0     3.93                     1
##  6           0                    0    10.2                      1
##  7           0                    0     4.91                     1
##  8           0                    0     0                        1
##  9           2.30                 0     8.74                     1
## 10           0                    0     1.61                     1
## # ℹ 2,862 more rows
## # ℹ 3 more variables: Sub_Category_RAMEN <dbl>, Sub_Category_RTS.SOUP <dbl>,
## #   Sub_Category_SOUP.IN.OUT <dbl>

baked_soup_testing
## # A tibble: 320 × 7
##    Any_Disp_Spend Any_Price_Decr_Spend Sales Sub_Category_DRY.SOUP
##             <dbl>                <dbl> <dbl>                 <dbl>
##  1           0                    0     7.61                     0
##  2           0                    0     3.50                     0
##  3           0                    0     5.12                     0
##  4           5.92                 6.65 11.1                      0
##  5           0                    0     0                        0
##  6           0                    9.33  9.95                     0
##  7           0                    0     7.97                     1
##  8           0                    0     6.00                     1
##  9           5.63                 6.80  6.96                     1
## 10           0                    8.58  9.18                     1
## # ℹ 310 more rows
## # ℹ 3 more variables: Sub_Category_RAMEN <dbl>, Sub_Category_RTS.SOUP <dbl>,
## #   Sub_Category_SOUP.IN.OUT <dbl>

Fit with Preprocessed Training Data

Let’s compare models with and without the preprocessed data.

# Fit a model without preprocessed data.
fit_raw <- linear_reg() |> 
  set_engine("lm") |> 
  fit(
    Sales ~ Any_Disp_Spend + Any_Price_Decr_Spend + Sub_Category, 
    data = training(soup_split)
  )

# Fit a model with preprocessed data.
fit_baked <- linear_reg() |> 
  set_engine("lm") |> 
  fit(
    Sales ~ ., 
    data = baked_soup_training
  )

Evaluate Parameter Estimates

If we want to use it, there is a horizontal version of the error bars plot.

# Compare parameter estimates.
tidy(fit_baked, conf.int = TRUE) |> 
  ggplot(aes(y = term)) + 
  geom_point(aes(x = estimate)) + 
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = .1) +
  geom_vline(xintercept = 0, color = "red")

Error Bar plot for ‘Raw’ Data

Error Bar plot for Pre-processed Data

\[\text{Sales} = 5.42 + 0.12 \times \text{Any_Display_Spend} + \\ ... + .18 \times \text{Sub_Category_RTS.SOUP} + ...\]

How does the log transform change parameter interpretations?

  • y and \(x_1\): A one unit increase in \(x_1\) leads to a \(\beta_1\) unit change in y (relative to the baseline level of \(x_1\)), holding all other variables fixed.
  • log(y) and \(x_1\): A one unit increase in \(x_1\) leads to a 100 × \(\beta_1\)% change in y (relative to the baseline level of \(x_1\)), holding all other variables fixed.
  • y and log(\(x_1\)): A 1% increase in \(x_1\) leads to a \(\beta_1\)/100 unit change in y (relative to the baseline level of \(x_1\)), holding all other variables fixed.
  • log(y) and log(\(x_1\)): A 1% increase in \(x_1\) leads to a \(\beta_1\)% change in y (relative to the baseline level of \(x_1\)), holding all other variables fixed.

Compute RMSE with Preprocessed Testing Data

bind_rows(
  # Compute RMSE without preprocessed testing data.
  predict(fit_raw, new_data = testing(soup_split)) |>
    bind_cols(testing(soup_split)) |>
    rmse(truth = Sales, estimate = .pred),
  # Compute RMSE with preprocessed testing data.
  predict(fit_baked, new_data = baked_soup_testing) |>
    bind_cols(baked_soup_testing) |>
    rmse(truth = Sales, estimate = .pred)
)
## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard    41040.  
## 2 rmse    standard        2.36

Wait…Were the Previous RMSE values Comparable?

The Model Improved…But not THAT Much

bind_rows(
  # Compute RMSE without preprocessed testing data.
  predict(fit_raw, new_data = testing(soup_split)) |>
    bind_cols(testing(soup_split)) |>
    rmse(truth = Sales, estimate = .pred),
  # Compute RMSE for predictions from preprocessed testing data...using the raw Y values
  predict(fit_baked, new_data = baked_soup_testing) |>
    bind_cols(testing(soup_split)) |>
    mutate(.pred = exp(.pred)) |>
    rmse(truth = Sales, estimate = .pred)
)
## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      41040.
## 2 rmse    standard      39296.

Apply a Recipe for Prediction

Wait, there’s more! Having all of the preprocessing steps saved as a recipe means we can easily apply it to new_data when using predict() to evaluate counterfactuals. Construct the counterfactual data in the same form as the raw data so that it will work with our recipe.

Recall the Format of the Raw Data

soup_data |>
  select(Any_Disp_Spend, Any_Price_Decr_Spend, Sub_Category, Sales)
## # A tibble: 3,192 × 4
##    Any_Disp_Spend Any_Price_Decr_Spend Sub_Category    Sales
##             <dbl>                <dbl> <chr>           <dbl>
##  1              0                    0 CONDENSED SOUP    839
##  2          16489                48977 CONDENSED SOUP 468607
##  3              0                    0 CONDENSED SOUP     91
##  4              0                 1954 CONDENSED SOUP 139993
##  5              0                    0 DRY SOUP           50
##  6              0                    0 DRY SOUP        27456
##  7              0                    0 DRY SOUP          135
##  8              0                    0 DRY SOUP            0
##  9              9                    0 DRY SOUP         6257
## 10              0                    0 DRY SOUP            4
## # ℹ 3,182 more rows

Construct Counterfactuals

Allocate $10,000 spend between display and price decreases for each sub-category.

# Counterfactual scenarios.
scenarios <- tibble(
  # compute display spend numbers, repeat them 5 times, once for each category
  Any_Disp_Spend = seq(from = 0, to = 10000, by = 2000) |> rep(5),
  # compute price decrease spend numbers, repeat them 5 times, once for each category
  Any_Price_Decr_Spend = seq(from = 10000, to = 0, by = -2000) |> rep(5),
  # character vector of subcategories, repeated six times six we have six counterfactual scenarios
  # in each sub-category
  Sub_Category = unique(soup_data$Sub_Category) |> rep(6) |> sort(),
  Sales = 1
)

What’s different about this new_data? How can we interpret each of the rows?

scenarios
## # A tibble: 30 × 4
##    Any_Disp_Spend Any_Price_Decr_Spend Sub_Category   Sales
##             <dbl>                <dbl> <chr>          <dbl>
##  1              0                10000 CONDENSED SOUP     1
##  2           2000                 8000 CONDENSED SOUP     1
##  3           4000                 6000 CONDENSED SOUP     1
##  4           6000                 4000 CONDENSED SOUP     1
##  5           8000                 2000 CONDENSED SOUP     1
##  6          10000                    0 CONDENSED SOUP     1
##  7              0                10000 DRY SOUP           1
##  8           2000                 8000 DRY SOUP           1
##  9           4000                 6000 DRY SOUP           1
## 10           6000                 4000 DRY SOUP           1
## # ℹ 20 more rows

Remove the outcome and apply the recipe before using the new data for predict().

# Apply the recipe to the training data.
baked_scenarios <- soup_recipe |>
  bake(scenarios) |> 
  select(-Sales)

baked_scenarios
## # A tibble: 30 × 6
##    Any_Disp_Spend Any_Price_Decr_Spend Sub_Category_DRY.SOUP Sub_Category_RAMEN
##             <dbl>                <dbl>                 <dbl>              <dbl>
##  1           0                    9.21                     0                  0
##  2           7.60                 8.99                     0                  0
##  3           8.29                 8.70                     0                  0
##  4           8.70                 8.29                     0                  0
##  5           8.99                 7.60                     0                  0
##  6           9.21                 0                        0                  0
##  7           0                    9.21                     1                  0
##  8           7.60                 8.99                     1                  0
##  9           8.29                 8.70                     1                  0
## 10           8.70                 8.29                     1                  0
## # ℹ 20 more rows
## # ℹ 2 more variables: Sub_Category_RTS.SOUP <dbl>,
## #   Sub_Category_SOUP.IN.OUT <dbl>

Generate point predictions and predictive intervals.

# Predict and bind on prediction intervals.
bind_cols(
  predict(fit_baked, new_data = baked_scenarios),
  predict(fit_baked, new_data = baked_scenarios, type = "pred_int"),
  baked_scenarios
) |> 
  arrange(desc(.pred))
## # A tibble: 30 × 9
##    .pred .pred_lower .pred_upper Any_Disp_Spend Any_Price_Decr_Spend
##    <dbl>       <dbl>       <dbl>          <dbl>                <dbl>
##  1  11.0        7.21        14.9           7.60                 8.99
##  2  11.0        7.13        14.8           8.29                 8.70
##  3  10.8        7.00        14.7           7.60                 8.99
##  4  10.8        6.96        14.6           8.70                 8.29
##  5  10.8        6.93        14.6           8.29                 8.70
##  6  10.6        6.76        14.4           8.70                 8.29
##  7  10.5        6.69        14.4           7.60                 8.99
##  8  10.5        6.63        14.3           8.99                 7.60
##  9  10.5        6.61        14.3           8.29                 8.70
## 10  10.3        6.46        14.1           0                    9.21
## # ℹ 20 more rows
## # ℹ 4 more variables: Sub_Category_DRY.SOUP <dbl>, Sub_Category_RAMEN <dbl>,
## #   Sub_Category_RTS.SOUP <dbl>, Sub_Category_SOUP.IN.OUT <dbl>

Other Preprocessing Steps

We can manually wrangle everything that a step function can do, but step functions can be a lot easier to use. Beyond step_log() and step_dummy(), for example:

  • step_corr() to check for multicollinearity and drop explanatory variables.
  • step_normalize() to subtract the mean and divide by the variance.
  • step_discretize() to make a continuous variable discrete.
  • step_num2factor() to quickly turn a numeric variable into a factor.

We also have additional helper functions like all_predictors().

Wrapping Up

Summary

  • Walked through preprocessing data using recipes.
  • Demonstrated the log transform and dummy coding.

Next Time

  • What if the outcome isn’t continuous?

Supplementary Material

  • Tidy Modeling with R Chapter 8

Artwork by @allison_horst

Exercise 13

Return again to soup_data and your models used in Exercises 11 and 12.

  1. Check for right-skewed continuous variables.
  2. Prepare a recipe using the necessary steps.
  3. Fit the models again on the preprocessed training data.
  4. Compute the RMSE again using the preprocessed testing data. Is the same model as last time still the best-fitting model?
  5. Use the best-fitting model (based on RMSE) and predict Sales using three possible ways to split $6000 for the promotions budget. As with the training and testing data, be sure to preprocess the new data. What should the client do with their promotional budget? Provide an interpretation using dollars.
  6. Render the Quarto document into Word and upload to Canvas.